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A Comparlaonj of Two Item Selectlop Procadurea for 
, Bulging Crltorloii'-Refarenced Taats 

i ■ ! ■ . ■ \ ' ■ ' 

i \ " ■ ' 

' I. ■ • 

Ik ^ , * I ' 

Within any form of systematic Instruction (e.g., mastery learning) , 
there Is a need for highly relevant achievement testsf to mon I tor achieve-* 
ment of Individual students. Such tests have tj^en commonly known as 
'•criterion-referenced'' (bR) . 

fn the area of CR tpsf reliability, two significantly distinctive ^ 
conceptualizations have been discussed (Hambleton, Swamlnathan, Alglna 
Coulson,^ • first refers to the cons Istency of correct pass or 

fall classifications from test to test, while the latter reflects the 
magnitudes of errors of jneasurement as It affects, decisions regarding pass- 
fair. 



Bath content Y^Mdlty and reliability are affected' by the manner in 



whigh CR+tests are constructed. ,Essent lal ly , test makers may develop domain 
Specifications or objectives, dreate items review these items using logical 
or empirical procedures, and select, items for, CR tests in much the'manner . 
recommended currently by teit specialists (e^g., Haladyna S Rold, 198l^J 
^mbleton, et al.,. 1978). The way items are selected forra CR test is a.n 

issue of major importance In CR test development and Is the focus o^'(thls 
study. 



V 



Two Approaches to CR Test Construction 

• • a ^ 

Random sampl ing. Classical test theory is based on the practice of 



random sampling from a wel 1 defined \Joma In of test Items (Lord s Novick, 
1968; Nunnally, 1967). The very same approach to test const r^ct Ion Is 
present In general Izabll I ty theory (Cronbach, Gleser, Nanda S Rajaratpam, 



\ 
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I97))i And the* |^'r«ctlc« of sampling It promlntnt In many dtscuaslona of 
^CR t««tlng (Brefintn & Kane, I977i HambUton at l978j'MIIIm«n", I97'*«,^ 
197^; Popham, 1978; Sho^mak^r, 1975)» - . ^ 

Thus It sieems dos Irabja! to randomly salact I tarns from a pool of 
It^ma which h|ive bean carefully dava1i)pad to represent $ome Impc^rtanfr 

Instructional targets, In practical howeveri wa are aware that empirical 

"ft 

procedures have been utilized via test blueprlnts'and other means have been 
sat Isf led (Mehrens & Ebel , 1979) ^ Most measurement' textbooks give strong . 
support to/the use of the results of Item analysis for selecting or removing 
Items from/ achievement tas.ts. A recent ^stydy by Haladyna and Rold (1979b) 
however, suggests that when item characteristic indexes are used to select 
Items for/ a CR test,* the results lead to, largeg errors of measurement when 
compared/to tests composed by random sampling. Therefore, there Is spme 
emplrlca/l [support for the practice of randomly sampl Ing Items.** 

- Latent trait theory. Recent Interest In latept trait theory has re- 
sulted fin a number of research studies and applications (e.g., Hambleton, 

Swam!nc|than, Cook, Eignor 6 Gifford, 1978; Wright, 1977)-. There have been 

1 ' ■ ' ^. 

several attempts to apply the simplest of these latent trait models,, the 

■ I ■ " . . ■" . ) 

Rasch model, to CR testing (Haladyna S Roid, 1979a; Hambleton S Cook, 1977; 

Rentz £ Rentz, I97&)* In the ^udy by Haladynacahd Rold (1979a)", the Rasch 

model secerned to be very robust In estimating student achievement despite 

problems wi^ the stability of estimation of the only parameter of the model 

Item difficulty. ' \ 

In theory, a test maker selects test Items for students In such a 

manner that the difficulty of the Items is matched to the achievement Jeyel 

of the student. When this Is accompl Ished, the error of measurement Vor 



gin«fi) liiitiM Uy thipry ,«nd ulio hiv<i meaning In Ut^nt triU thtory (tprd 
ft NovTak, 1968, ;^pp, 386-^387) Thu diflrvltlan^ im ^lio qis^ntrill'y ieiipcabU 
In discussion* on CR U'iting (Htmbl#t;on/ Sw«(mlniifth*in, Algln4 & CoMUai^^ U)78 
Mlllman, I97^«)* ^ 

)« ,An Item unJv«r^«i Is g<incirat«d Chat adequately and logically 
represents the target of Instruction, and thrs unVvers^ can be consid- 
ered to be *'unldlmensIonaP* In the sense that 1 1 represents a holistic 
tral t • s .J ^ 1 

2, A true scorie*|I s* ,the result^'bbtalned by a<jlm|nl3terlng al I Items 
In the Item universe ,^o. an examinee in the population of exarnlnees for 
which the test Is Intended, i 

jr ' An observed score^ls the result ojot^lned by administering 'a sub- 
set of these Items to an examfrtel'!^' L" " / * 

The observed score Is /am eHlniator of /jtha^^^ score and Js 
unbiased when the score'' Is 'abased on a random damp I of Items, ^ / 
' f S.* An error of measurement 'Is^ t;^he difference between a true and " , 

^l\t Is verV" rare, If.nat nearly Im^b'sslble, to obtain ti'ue scores, 
Yet^much progrdsis^ has. ^been made' specify ^ domains to the ftj^cnt 



that finite Item universes are ^pecl f latjle and» In e;(per ImentJ^l conditions, 

' ' _ ' ' ■r, . ' ^ " ' ■■ \ ■ , ^ • ' 

entire, f Initeodqmalns have been ailhilnr^ejed to samples of studen^^ 

(Haladyna^S Rdidi^^^!^/ ^ThuSj'^true scores may be directly observed. 

Given an i'tem-by-person mat rlx '.of response^ to items wheVe the^f In Ite Iten 

universe has' been' admlnistered, It Is possible to systematlcajiy construe^' 

tests of varying length using different test construction straieg-ies for'w 

the purpose of makihgvCompa^l sons in 'terms .of errql^jfe^ That^ 



U, w« can Hit m . i t«m-by-|i«rion mtr\n to canstmct ta«u mlna r«indQin 
I tiatnplinu \HmU trtilt: praetdMrttil, Mnd (h« «imMlitail t»u r«»$tilt:^ will 
Uiiil to ramoiiitbU *»»tlm«ti|iii eif i:hii oisgnlciKk* of irrar* of iiici«isitiram«?»n( 
thrtt iirltufii from chtii^d two iippro4ah««i to ^tit; consin uut lan^ 
Th«r«for«, th^i lncJ«pf#n<l<iint v<irUbUi of t\m ^itiKly w«ir«: 

1. Two nmthocli of uat cqnitruatlon mndom sampling vj*. ^isiUctlon 
of Udmti b4i^4dd on tho in^tch b«t:w«>0n ^tiuUnt ptarforyi^nqa Irtv^l v^nd Itdm 
difficulty. 

2. ' Four t«at Ungths, 10, 20, 30, ^lO Iti^m^. 

3. Four types of CR tast data varying In sensitivity to Instruction. 
The ddpondont measure of the study included the absolute average 

deviation and a ratio of error variation and true score variation, \two 
statistics which represent the amount of measurement error present In 
any set of test scores. 
Sources of Data 

Four item universes were administered to students prior to and. f^l low- 
ing Instruct ion. These item universes vary widely in content, educational 
level* and sensitivity to instruction. The first two data sources contained 
Items representing objectives which first-year dent^^l students were to learn 
as part of a course in dental anatomy. The second two d^ta sources were 
obtained from e;l ementary school children as part of an instructional pro- 
gram assessment. All of these ^sts were objective-based and administered 



as part of instruc\ibn. Summary statistics for these CR test data are pre- 



\ 

i 

sented in Table h As shown there, the instructional sensitivity (pretest 
vs. posttest differences) Of thes^ tests varies widely, .from 18. A% to ,56. 3%. 
It IS also important to note that these four data sources differed in 



iht rinqt lUuiiloni batman IntitrMadqn* 



For <in<!Jh Jilt*! ?*OMrc«*, posit fc*jtit r^tsMlU war** 4i thU aoruiltloM li 

thu m%t promln«ntly M^ad In reliability 4ncl v^ilUlUy i^h^ly^i^si In pr^cdcci, 
WhI U priit4»tit d4U liiQ r^bU for oth^^r |^«i<i«ion:i« ^uch «i« Unm 4fi4ity<tti 
(H«Udynd & Rold. igOl), |t U ©xpenslvii «nU difficult to obtain, and It U 
Inttfflclant from thu ^t^ndpolnt of un^tjn of ^twdont tlm«, 

,Uslnu th« p<ir$on**by-l t0m matrln for ««ch dut^ ?iourc«i, thr«« 10-, 20-, 
30-, «nd ^O-ltom samples w«r« randomly drawn from the ! tarn unlvaraa to 
simulate , several forms of randomly composed tests of these varying lengths, 
a total of 12 such tests. Each of these tests were then scored^uslng 
student responses to these particular Items. 

The Rasch model is L^sed to support the notion that when the difficulty 
of a test is matched to the level of the examinee, the error of measurement 
Is minimized, 'Therefore three conditions can exist when an examinee encounters 
a test: (a) the test Is at-level and ferror of measurement, is small, (b) the 
test is too difficult or too easy and the error of measurement is large, or 
(c) the test is near the level of the examinee and the error of measurement 
Is moderate. . * 

r . - ■ 

In this study, all. three conditions were sin^ulated. This was accom- 
plished by building test forms which varied systematically in difficulty and 
by subdividing the sample of students into four equal quart i les. Certain 

combinations of test fo'^ms and student samples yielded situations where the 

P. 



for l^dMr 0#t;<i ^uMti^ii* 
SoMfcti ti m»iin;^; n main s.U. In In I na cruet I an 
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n 



ilUttid to b« ?im4lr 4ii wull «ii off Uv«il i«*«*U wh<if« ih«»si« <sir rui si w«ii « 
prudicciscl to b« rha l<»ry«?it . fhua, twu kliUl^ Of UtM»% w«i To «siv«iM 

h4iW0#n th« two tmm ciHUtiuctiun 5itri*t«aul , anul (b) buiwcittn 
4t-r#v«l «mJ i>ff-Uy#l within iUm ft4t»<:h, Ut»Mt tr<iU 4iMpro<iih, 

Ann I Y It I n of Oiiit ^ , . * 

Th« rwut^t n of «itich of th<i4« 12 form wan th«n comp«r«d tin lot) ^ 
tt4tUtlc conceived by H^mbUcon, Hutten, 4nd Swdmlnath#n (1*3/6) for tuch 
cotnp^rUont, t;h« Avuraq* Ab^ol ut« 01 f f «r«inc0 (AAO) . Thlti st«itUtlc U 
u^^iful in d^scrlblaq the 4iviir4isj«a mutjnltud* of ^rrorn of m«iiiiir«m«inr; whan 
the true scores are known. H^imbleton et al,, (19/6) used AAD with simulated 
data to compare several methods of estimating true scores. 

AAO Is highly dependent upon the scales being used. Since random 
s*amples of items lead to percentage correct scales and the use of the 
Rasch model leads to aci entirely different scale, a scale-free statistic, 
(E/T), was created which was free of this dependency upon the scale but 
indicated the degree of error extant in the data as a function of the true 
scor^ variance. This statistic was the ratio of AAD to the standard 
deviation of true scores. E/T is similar to the s Ignal -to-nol se ratio 
, discussed by (Brennan and Kane (1977) except this statistic is not based 

10 
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> loiilciiti ippro)^fl«(# liiv^l for tru^tru 



Ittrtuutly ItKj «iity or too h«rd. 



f: 

f Hi. 1 4»f I n h4i »t«fli I «JM t o fi«f f h# jiaii |j t n I t H y 1 hit I hm *n 1 4 r I Imj I I iv\ 

«>f %coi«ii ar lyp« of itr ibui loti W4k% m fiii tor in tfn^ I ii t a i niji th« il«iyriii# of 
tti««iv4r«<n#ru <»rror th thii?ij<i lo « %ttuJy by M«iUilyn# «fViJ Halvl (1*^80), 

wh«ri «i rur» of c I l,f t t i t)fii warn ^tiuliad, 4 cr ic«irtan l^v«il w*i^ tJuKir 

fult* of ih^t *tutjy lndl,C4r,<(id th^it t yp# of dUirM^Mtloti of ii*.i)r«^ wn^ 

th« nt«Jor factor In d#t<jirml n 1 n<^ c 1 4f 1 1 f i c*t lin mrror%. Thin third factor 
In th« deiiiyn coniiUt«diqf cat«<jorUf of icht»v^i«nt, wh«r« th« student 
samp!« was divldad Into four groups based #n their true *cor«5» the flr^jt 
group being the highest; achieving group and the fourth being the lowest. 

With each data source, there were only a small number of level tests 
for test lengths of 30 or ^0, so Interactions were not considered part of 
the design due to Insufficient numbers of observations in some cells. The 
variance from these interactions were pooled with residual variance and only 
main effects were reported. Since the concern was for the contribution of 

1 ' > 



in 



eacnmaln effect In explaining eirror variance, results were reports 
proportion of variance accounted for each main effect following a test 
of statistical significance where alpha was , set at .001. 

Results and P I scusslon 
The results of the analyses of variance for each of the four data 
sources are reportect^ In- Table 3. All results are reported \ln percent of 



'/ . , - ' Inseri.Table 3 %bout here " 

accounted variance as all main effects were highly statistically signlfl 
cant (p < .001). Sample sizes, means, and standard deviations for all 
factors and data sources appear In Table 4. Of the four data sources, 



Insert Table k about here 



three proved to have sufficient conditions for the establishment of at- 
level tests for each test length and sample condition. For the first data 
set, where the sensitivity to Instruction was greatest and where posttest 
scores were uniformly high; no^ level tests existed for the first three of 
four sample conditions studied, that Is, the first three quartlle groups 
consistently scored over 90%; and at this level, no test form provdti suffir 
clently difficult for any of these pamples to justify the designation as 
an at-level test. The results for the first data set are based on test 
scores for the fourth group only which had a wide range of achievement test 
scores (70 - 90%). ' 

The results of this analysis of the sources of errdfr variance can be 
classified Into three categories: (a) test construction technique, (b) test 
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Table 3 



Percent of Accounted Variance for Each Main Effect 

■ ■ / 

Data Source Data Source 2 Data Source 3 Data Source.:^ *f 



Type of Test 


51. 


13.7% 


12.9* 


14.8*// 


Test Length 


40.3* 


23.5* 


33.1* 


51.1*. 


Type of Sample J 




50.0* 


• i»0;8* 


31.8* 


Total Proportion 




87.8* 


96.8* 


/ 97.7* 



of Accounted 
Variance 
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' Type of Test 
I • Random Sample > 

2. At-Level 

3. Near- Level 
k. Off-Level 
Test Length 

1 . 10! tems 

2. 20 items 

3. 30 Items 
'i. ^0 Items 
Type of Sample 

1 . First Quart l ie 

2. Second Quart I le 

3. Thi rd Quart Me 
k. Fourth Quart Me 
Total 



Table k 

Sample Size, Means, and Standard Deviation for 
Each Main Effect and Data Soui^ce 



Data Source 1 



H 
12 

5 
2 
10 

1 1 
7 
6 
5 



29 



mean s.d. 



0.93 
0.83 
1 . 02 

1.5^ 



0.41 
0.30 
O.O'* 
0.19 



Data Source 2 



n 


mean 


s.d. 


n_ 










' kS 


^ 22 


1 .86 


i«8 


23 


2.5^ 


1.59 


^30 


30 


3.'»9 


1 .69 


21 


33 


^.53 


1.91 


13 


■ 

52 


i».5'» 


1 .9'» 


<iO 


32 


3.38 


1 .69 


28 


28 


' ?.80 


1 .58 


2i» 


28 


2.20 


1.16 


20 




* 






35 


2.51 


0.78 


28 


35 


5.28 


1 .90 


28 


35 


k.\3 


1 .58 


28 


' 35 


1 .87 


0.75^ 


28 




3.'»6 


1 .90 


112 



Data Source 3 
mean s.d.j 

1.77 

l.S^* 
2.17 
3.13 



1 .03 
2.56 
3-00 
1.19 



1.261 

1 .09 

1.16 

i 

1 -7^ 




\.5k 



1 . 03 
0.62 
.35 



0.^ 
1.29 
1.51 
0.i»8 



Data Source 
• s- 

mean s.d . 

W l.2'» . 0.6'J 

18 1 .18 0.73 

i2r l.B/i 0.98 

Ih 1 .82 " 0.61 

ijO 2.1'J 0.75 

28 1 .i»5 0.'»8 

Ih l.Ol 0.39 

20 0.72 0.25 

k 1.15 ■ 0.'»8 

28 . 1 .82 ' 0.78 

28 1.98 O.B^J 

28 0.9^ O.^tO 



1.9'* 1.33 



112 \.hl 0.78 



length, arid (c) type of sample condition. These become the objects of 

further discussion. 

Test Construction Approach 

For the latter three data sources where the type of sample was not 
a problem, the approach to test construction typically accounted for a 
relatively small but highly stati sf iqal ly significant proportion of var- 
lance. ^In each and every data sample, the-at-level tests consistently 
produced the smal lest errors of measureflJent • * ^ 

The criterion of effect size was used hereto describe the magnitude 

• * • / 

of the differences observed. Effect size is simply the number of standard 
deviation unitSv Jthat two means differ. The differences between Rasch-based, 
at-level tests -and ^^ndomly generated tests represented smal 1 effect 
sizes, '",23, -36, and .08 respectively. While these effect sizes are 

smal 1 , corresponding to the proportion of accounted variance shown In 
Table 1, the results clearly demonstrate that when the difficulty of the; 
tests are appropriate to the level of achievement of a parti*cular sample, ^ 
the errors of r measurement are distinctly and consistently smaller, r 

LookJng\at tests -tfiat were judged to be near-level, errors of measure- 
ment were consistently higher^than the at-level test results'./ The magnitude 
of these effects was ,50, .^7, and ,85. Further, these means were^ 

higher than those reported for tests where Items were randomly '^chcSs en . 
These results shdjid indicate the procedure for identifying level tests 
was valid and that near-level tests have considerably more errors of measure 
ment than randomly generated tests as well as at-level tests. As antici- ^ 
pated, off-level tests were conrsiderably error-r idden in contrast to other 
conditions. The one exception to this, data source four, was due ^to a 



large amount of InstabI 1 1 ty 4n VO-I tern test forms for the second. and third 
/ quarti les. ^ ^ * . ' 

The iM^rst level analysis establishes the vaTidlty of constructing 
. achievement tests which match the level of achievement of the" student. 
Randomly selecting items^ as Is advocated In classical test theory, gen- 
eral Izabi llty theory, and Other approaches (CR testing) where an item 
domain Is believed to rep r^esent -the object of Instruction, does not produce 
,the best tests In terms of minimizing errors of -measurement . On the other, 
hand, Rasch-based tests do*. A finer level of analysis was conducted to 
ascertain the bias of error In estimating student scores as a function of 
the degree to which a test matched /the achievement level of the examinees. 

An examination of the AAD*s (the mean difference of true and observed 
scores) across each cpr^dltion reyealed that a systematic bias did occur 
as a function of the ^fference between the level of the test and the level 
. of the examinees. When the test form was significantly too easy, student 
observed scores tended to be higher than true scor.es A When the test form 
was significantly too hard, student observed 'scores tended to be loWer than 
true scores 

This Is a reasonable finding. The' Rasch :model yields domain score 
T aes'^imates that are 'higher when the group of items upon which the estimate 
' I s based' are e^sy . Conversely , doma in score estfmates are deflated when 
£he set^of Items Is hard relatl^ to the student/s achievement. 

Clearly ,.vfor high achieving students, hard tests do more harm than >. 
good. On , the other' hand, a student who takes aq easier test is more likely 
to be overrated because the mismatch between a low achiever and more diffl- 
cult Items yield an overestimation of student achievement. . In either easel, 
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the results are larger errors of measurement which are the products of an 
Inappropriately dl fflcult test. The results over all four data sets show 
this, to be consistently true. 
Test Length 

It was expected that errors of measurement wpuld be greatly affected 

by test length. While this is theoi^ically predicted, the design of 

this study permitted a look, at the magnitude of decreases in errors of 

'/ ■ ' ■ ' . ^ ■ 

measurement as. a function of test length. 

9 .■ . ' ' . , ^ ^ 

These results, reported in -Tables 1 and 2, indicate that test length :■ ; 

was a very significant factor, accounting for 23.^%, 33.2%, and 50.4%- ^ 
three or*the four data sets. In the first data set, where the'distributloa ^.O 
of scores was badly skewed, test length accounted for ^0.3% of the variance. 
Thus it is clear that test length ls a powerful factor in reducing measure- 
ment errOr. , ; 

_ , . ■ • • ' : 

The results allow us to examine the magnitude of dect^ease in measuremerit^^ 
Errors as a fufictlon of test lengths. Th^ie'are briefly summarized below 

^ ■ \ ' 'X . • ... ;■ ■■ 

In terms of effect size. 







From 10 to 20 1 tems 


From 20 to 


30 Items 


From 30 to AO ^ tems 


Data S^t 


1 




.54 




.32 


; Set 


2 


.79 


.53 




.32 


Set 


3 


•72 


^ .38 




.A2 


Set 




.90 

f 


•57 




.38 ; 



A large effect size indicates a substantial reduction in meals urement^ error 
fro.m one test length to the next test length. From these resul ts summarized 
above, It is clear that 20-item tests offer the largest Increase In precision 
from lO-'ltem tests and the increase between 20-ltem and ^O-ltem tests Is 

. 1 O 



■also substantial, while the Increase In precision between 30- Item and^ 
40-Item tests Is smal lest iCr threat of the four data. sets. Whll^e It Is 
clear that ^0 It^ tests yield the best estimates of true scores as might, 
be expected, 30 and 20-item tests are not that su|s.tant ial ly Inferior. 
In terms of overall test proficiency, tfrese results would suggest t;hat 
20-Item tests offer the most for the least, while gains made with longer 
tests' are less substantial. Where one draws the line with Yespect to the 
number, of test^ Items Is a mat*|er Jf the coi^j^seqi^nces one places on making 
decis ion errors In systemat Ic instruct Ion>.(lHaladyna 5 Rold, 1980). 
Type of Sample ^ " - ' ^ 

The third \factor of the study wa^ the type of sample (range of exam- ^ 
inees)j. As noted earlier, each group of students was divided Into quartlles 
representing four sample cond It Ions: hjgh, high middle, low middle and low. 

Results In Table 3 would Indicate that type of sample was a significant 
factor In determin Ing* errors . However, it must be made clear that the 
criterion for this analysis was the statistic E/T. As noted previously, 
this ratio Is seal e- Independent . The results of Table 2 Indicate that E/T 
Is highest'for the two middle quart I les where student scores varied the . ^ 
least. 1 c * ' 

A more useful criterion is AAD which Is based on the difference between 
true and observed scores. While E/T Is metric free. It Is affected by the 
distribution of true scores. AAD is not metric-free but Is Is, not affected 
by ^the distribution. Therefore, AAD war used to ascertain the ampuht of 
error extant in the data sets as a function of the four types of 'samples 
studied. Since at-level tests were the moj^t^recise in estimating student 
scores, these ^ests were studied across'the three data sets where 



the four samote conditions existed using a one-way analysis 6f variance 
with AAD as tn^ dependent measure. 

The resul ts\of thi s analysis revealed no differences as a function 
of sample type (F«0.3^; df«3.73; p»,80)» The means for the four respec- 
tive sample conditions were: .306, .333, -338, and .3^3 with .an overall 
standard deviation of .135- It was conclusive from these results that when 
at-level tests are employed to estimate domain scores, errors of measure- 
ment do not vary significantly with the type of sample condition. 

Concl us Ions 

^ Test Construction Approach 

The main objective of the study was to determine If a difference 
existed In the magnitude of measurement errors of tests constructed two 
different ways. The results were consistent across four data sets which' 
represented varying degrees of sensitTvity to instruction. Tests created 
by selecting appropriate ti^lfflcul ty levels for students based on the Rasc|i 
irodel yielded smaller errors of measurement than tests which were created 
by randomly sampling items. These results offer support for the concept 
of latent trait theory as a basis for test construction and the practice 
of providing achievement tests at the functioning level of each student 
rather than the level of heterogeneous group of students for which a stu- 
dent Is d vheml^er. 

The results also suggest that random sampling of items is a second-best 

^ , ■ ■ . •■ 

alternative, the di fference between the randomly sampled tests and the 

* ' . .* ■ 

.Rasdh-cal ibrated tests was not large in terms of the criterion of effect 

size. Nonetheless, there was a statistically significant difference' in 

each Instance. 



The study also serves to show that when students receive tests that 
are not at their level of functioning, errors of measurement tend to be 
substant ial ly .higher than either randomly sampled tests and at-level 
tests. Thus the practice of level-testing , If the assignment of students 
to levels Is done subjectively by human jud^ent, is Indeed a del Icate 
technique to employ In school assessments. When a test Is appropriate 
to examinees,, this study has served to ^how that domain score'sr/are pre- 
cisely estimated. When the test Is not appropriate for examinees, errors 
are quite substantial. " . 

The CR test developer Is wise to understand the benefits and deficits 
of these two test construction strateg les , both of which require Item pools. 
Random samp 1 1 ng Is a more conservative practice which guarantees a moderate 
but controllable amount of measurement error, l^vel testing provides a 
chance for superior precision at the expense of the chanciness when a stu- 
dent encounters a test that Is too hard or too easy. \\n this respect, the 
Portland (Oregon) Public Schools, wher^ such level tests are employed, uses 
a placement test as a form of pretest, which aims the student at the test of 
appropriate level. This seems to be a sens Ible approach, which is now 
grounded In research findings that support the practice. 
Test Length • 

It is well known that test length is a powerful determinant of reli- 
ability and measurement error. This study not only provided support for 
this principle but indicates that errors of measurement are not evenly a 
function of test length. If anything, the relationship between measurement' 
error and test length is a curvilinear function with the greatest decrease 
in measurement error occurring between 10. and 20-item tests and decreasing 
as tests reach lengths of kO Items. 
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' As Hambleton (1979) among others has noted, one goal in CR testing 
Is to arrive at reliable domain score estlm^teis without unnecessarily 
long tests. The results of this study would suggest that test lengths of 
less than 20 would probably not lead to reasonablv^e domain score estimates, 
but satisfactory precision can be achieved for test lengths of 2d to 30 
Items.. Beyond 30 items, gains in precision are offset by the longer tests, 

A 

This, however. Is a rather subjective conclusion. One needs to set test 
lengths based on considerations of time allocated for testing, number of 
students who are likely to be classified as fail or in need of remedial 
Inst ruct ion, ■ and other considerations. Precision is only one of several 
factors that are used to determine the test length of a CR test. 

It would be interest ing and important to develop firmer guidelines 
regarding the relationship between the two. More importantly, guidelines 
for test length should be grounded in theory and be empirically tested 
to ascertain their ef feet iveness,^- How long to make^TirR-<test is still a 
problem of concern. 
Sample Type- ^ 

It was clear for this study and fror/i principles of latent trait theory, 
that errors of measurement vary as a function'of the discrepancy between 
the student and the test. If a test is too hard tp^too easy, there is a 
bias in domain score estimation that occurs, and this bias Is manifested 
in large errors of measurement. Despite the fact that four disparate 
sample conditions were employed, representing quartiles of the distribution 
of all examinees, no d I f ferences were found in the AAD's of these sample 
types. They were remarkably stable across the four sample types studied. 
While bias exists in domain score estimation as a result of inappropriate 
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level of test. It does not exist for groups of students wha d I ffer In 
achievement as long as the test they are given Is appropriate to that 
level. 

While this study provides strong support for the practice of build- 
ing Rasch-based tests of varying degrees of difficulty to minimize errors 
of measurem<»nt and to achieve reliable domain score estimates, a technology 
for developing ahd using these tests in objective-based instructional pro- 
grams is just emerging and requires more empirical studies which examine 
aspects of test construction which directly affect domain score estimation. 

Ohe of these aspects includes item analysis, particularly the stability 
of difficulty estimates. Haladyna and Roid (1979a) have shown that serious 
discrepancies in d I f f icul ty est imates obtained from different samples 
differ substant lal ly , a result wh I ch S 1 ind^ and Linn ^1978) observed 'in 
their study of norm-referenced 'tests. 

In summary, this study has proven that latent trait theory, particularly 
the one-parameter Rasch model » has much to offer users of CR tests in pre- 
cisely est Imat ing achievement with respect to a wel 1-defined. content domain. 
Since domain estimation is a goal of CR measurement, the lateXt trait approach 
to CR testing holds much promise. 
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A Procedure for Assigning Tests to One of Three Categories: 
(a) At-level, (b) Near-level, and (c) Out-of-level 
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Apptnidlx A 

A Procidura for Asilgnlntfi Ta«t« to One. of Throe Cat0gorl«i«: 
(•) At-lov«l, (b) N««r~l«vtlp «nd (c) Out-of-Uvtl 

In this 9tudy» tests of varying lengths were systematically con*- 
structed using difficulty levels as the basis for Item selection. The 
goal was to construct tests which varied In difficulty. Four different 
samples were used. Each samfille was cfheated by subdividing the population 
of examinees Into^four equal quartlles; each quartlle representing a 
different level of\achlevement . 

A problem remained as to Identifying the appropriateness of the 
Interaction between any test form and the level of achlevemertt of that 
sample. For any sample, a test form could be appropriate to the level 
of examinees (+) or It could be nearly appropriate (01 ) , or It could be 
Inappropriate, that Is too hard or too easy (02). The following proce- 
dures were developed In this study to ascertain which of the three condl* 
tions described abpve, +, 01, or 02 existed wi th each test form generated 
in this study. 

The procedures were based on an analysis of the median and range of 
true scores of examinees in each quartlle as well as the optimal range of 
test scores for a particular test. The optimal range for any test form 
was determined to be the range of scores for which the standard error of 
estimate is minimal. This range is symmetrical around the center of the 
scale; the size of the range was plus or minus 20 percentage points from ' 
this midpoint of the scale. For example. In a 30-item test, the optimal 
range was the Rasch logits equivalent to range of scores f^om 30% to 70% 
on the 30-^1 tert scale (raw score 9 to 21). 



Tp Mluitratt this proicttciurtti • 2O"'lt0m t«it from th« flrit d«ti 
sourqft\^li uitd. lifting th« fourth quirtMei for this itiilyslft of the ta8t» 

for the 20*1 t«m te«t the medlen wei ^I.S'^ end the range wea -*2.24 to 

\ , ' . .... 

-C.g^i* . Jhe f^edlan fQ.r the students In the fourth quart Me was 1.88 and 
the range was 0«78 to 2.38. Obviously there was no commonality between 

tha two respective medians and ranges, and the 20<*ltem test form was 

* ..' » ■ «- 

designated 02, off-level* Whera a good match between the median and 

optimal r^nge of a tisst form and the median and range of true scores 

existed, the designation -fy at-leval » was given. When there was a close 

match* the designation was 01, near-level. 

This procedure was applied to all four data sources to^arrlve at ^ 

assignments of test forms. Validity of this" procedure was evident In 

the results of the study. It was predicted that at-level tests would 

have appreciably lower AAD and E/T than near level and off-level tests. 

• . ... <, 

This prediction was confirmed in all four data sets. 

The results of the application of this procedure to the four data 

■ - • 

sets are given In Tables 5i 6i 7> and 8. 

Insert Tables 5, 6, 7 S 8 here ^ i 

• ■ / ■ 
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